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Abstract 

Conventional  paradigms  of  machine  learning  assume  all  the 
training  data  are  available  when  learning  starts.  However,  in 
lifelong  learning,  the  examples  are  observed  sequentially  as 
learning  unfolds,  and  the  learner  should  continually  explore 
the  world  and  reorganize  and  refine  the  internal  model  or 
knowledge  of  the  world.  This  leads  to  a  fundamental 
challenge:  How  to  balance  long-tenn  and  short-term  goals 
and  how  to  trade-off  between  information  gain  and  model 
complexity?  These  questions  boil  down  to  “what  objective 
functions  can  best  guide  a  lifelong  learning  agent?”  Here  we 
develop  a  sequential  Bayesian  framework  for  lifelong 
learning,  build  a  taxonomy  of  lifelong-learning  paradigms, 
and  examine  information-theoretic  objective  functions  for 
each  paradigm,  with  an  emphasis  on  active  learning.  The 
objective  functions  can  provide  theoretical  criteria  for 
designing  algorithms  and  determining  effective  strategies 
for  selective  sampling,  representation  discovery,  knowledge 
transfer,  and  continual  update  over  a  lifetime  of  experience. 

1.  Introduction 

Lifelong  learning  involves  long-term  interactions  with  the 
environment.  In  this  setting,  a  number  of  learning 
processes  should  be  performed  continually.  These  include, 
among  others,  discovering  representations  from  raw 
sensory  data  and  transferring  knowledge  learned  on 
previous  tasks  to  improve  learning  on  the  current  task 
(Eaton  &  desJardins,  2011).  Thus,  lifelong  learning 
typically  requires  sequential,  online,  and  incremental 
updates. 

Here  we  focus  on  the  aspect  of  never-ending 
exploration  and  continuous  discovery  of  knowledge.  In  this 
regard,  lifelong  learning  can  be  divided  into  passive  and 
active  learning  (Cohn  et  ah,  1990;  Zhang  &  Veenker, 
1991a;  Thrun  &  Moeller,  1992).  In  passive  learning  the 
learner  just  observes  the  incoming  data  while  in  active 
learning  the  learner  chooses  what  data  to  learn.  Active 
learning  can  be  further  divided  into  selective  and  creative 
learning  (Valiant,  1984;  Zhang  &  Veenker,  1991b;  Freund 
et  ah,  1993).  Selective  learning  subsamples  the  incoming 
examples  while  creative  learning  generates  new  examples 
(Cohn  et  ah,  1994,  Zhang,  1994). 


Lifelong  learning  also  involves  sequential  revision  and 
transfer  of  knowledge  across  samples,  tasks,  and  domains. 
In  terms  of  knowledge  acquisition,  the  model  revision 
typically  requires  restructuring  of  models  rather  than 
parameter  tuning  as  in  traditional  machine  learning  or 
neural  network  algorithms.  Combined  with  the  effects  of 
incremental  and  online  change  in  both  data  size  and  model 
complexity,  it  is  fundamentally  important  how  the  lifelong 
learner  should  control  the  model  complexity  and  data 
complexity  as  learning  unfolds  over  a  long  period  or 
lifetime  of  experience. 

We  ask  the  following  questions:  how  can  a  lifelong 
learner  maximize  information  gain  while  minimizing  its 
model  complexity  and  costs  for  revision  and  transfer  of 
knowledge  about  the  world?  What  objective  function  can 
best  guide  the  lifelong  learning  process  by  making  trade¬ 
off  between  long-term  and  short-term  goals.  In  this  paper 
we  focus  on  information-theoretic  objective  functions  for 
lifelong  learning,  with  an  emphasis  on  active  learning,  and 
develop  a  taxonomy  of  lifelong  learning  paradigms  based 
on  the  learning  objectives. 

In  Section  2  we  give  a  Bayesian  framework  for  lifelong 
learning  based  on  the  perception-cycle  model  of  cognitive 
systems.  Section  3  describes  the  objective  functions  for 
lifelong  learning  with  passive  observations,  such  as  time 
series  prediction  and  target  tracking.  Section  4  describes 
the  objective  functions  for  active  lifelong  learning,  i.e. 
continual  learning  with  actions  on  the  environment  but 
without  rewards.  We  also  consider  the  measures  for  active 
constructive  learning.  In  Section  5  we  discuss  the  objective 
functions  for  lifelong  learning  with  explicit  rewards  from 
the  environment.  Section  6  concludes  by  discussing  the 
extension  and  further  use  of  the  framework  and  objective 
functions 

2.  A  Framework  for  Lifelong  Learning 

Here  we  develop  a  general  framework  for  lifelong  learning 
that  unifies  active  learning  and  constructive  learning  as 
well  as  passive  observational  learning  over  lifetime.  We 
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Figure  1:  A  lifelong  learning  system  architecture  with  the 
perception-action  cycle 

start  by  considering  the  information  flow  in  the  perception- 
action  cycle  of  an  agent  interacting  with  the  environment. 

2.1  Action-Perception-Learning  Cycle 

Consider  an  agent  situated  in  an  environment  (Figure  1). 
The  agent  has  a  memory  to  model  the  lifelong  history.  We 
denote  the  memory  state  at  time  t  by  mt  .  The  agent 
observes  the  environment  and  measures  the  sensory  state 
s,  of  the  environment  and  chooses  an  action  at .  The  goal 
of  the  learner  is  to  learn  about  the  environment  and  predict 
the  next  world  states  sl+l  as  accurately  as  possible.  The 
ability  to  predict  improves  the  performance  of  learner 
across  a  large  variety  of  specific  behaviors,  and  is  hence 
quite  fundamental,  increasing  the  success  rate  of  many 
tasks  (Still,  2009).  The  perception-action  cycle  of  the 
learner  is  effective  for  continuous  acquisition  and 
refinement  of  knowledge  in  a  domain  or  across  domains. 
This  paradigm  can  also  be  used  for  time  series  prediction 
(Barber  et  al.,  2011),  target  tracking,  and  robot  motions 
(Yi  et  ah,  2012).  We  shall  see  objective  functions  for  these 
problems  in  Sections  3  and  4. 

In  a  different  problem  setting,  the  agent  is  more  task- 
oriented.  It  has  a  specific  goal,  such  as  reaching  a  target 
location  or  winning  a  game,  and  takes  actions  to  achieve 
the  goal.  For  the  actions  at  taken  at  state  s',,  the  agent 
receives  rewards  rt  from  the  environment.  In  this  case,  the 
objective  is  typically  formulated  to  maximize  the  expected 
reward  V(st).  The  Markov  decision  problems  (Sutton  & 
Barto,  1998)  are  a  representative  example  of  this  class  of 
tasks.  We  shall  see  variants  of  objective  functions  for 
solving  these  problems  in  Section  5. 


2.2  Lifelong  Learning  as  Sequential  Bayesian  Inference 

In  lifelong  learning,  the  agent  starts  with  the  initial 
knowledge  base  and  continually  updates  it  as  it  collects 
more  data  by  observing  and  interacting  with  the  problem 
domain  or  task.  This  inductive  process  of  evidence-driven 
refinement  of  prior  knowledge  into  posterior  knowledge 
can  be  naturally  formulated  as  a  Bayesian  inference 
(Zhang  et  ah,  2012). 

The  prior  distribution  of  the  memory  state  at  time  t,  is 
given  as  P(m~ ) ,  where  the  minus  sign  in  m~  denotes  the 
memory  state  before  observing  the  data.  The  agent  collects 
experience  by  acting  on  the  environment  by  a,  and  sensing 
its  world  state  s, .  This  action  and  perception  provides  the 
data  for  computing  likelihood  P(st,a ,  \m~)  of  the  current 
model  to  get  the  posterior  distribution  of  the  memory  state 
P{mt  st , a, ) .  Formally,  the  update  process  can  be  written 
as 

P(m,  I  s„at,m~) 

_  P(mt,st,at,m~) 

P{s„at,m~ ) 

=  — — - —P(>n,  \s„al)P(s„a,  \  rnt)P(mt) 

P(st,at,m, ) 

cc  P{m,  |  5,,a,)P(^,,a,  |  m;)P(m;) 

=  P(m,  |  s,)P(s,  |  a,)P(a,  \  m;)P(m;) 

where  we  have  used  the  conditional  independence  between 
action  and  perception  given  the  memory  state. 

From  the  statistical  computing  point  of  view,  a 
sequential  estimation  of  the  memory  states  would  be  more 
efficient.  To  this  end,  we  formulate  the  lifelong  learning 
problem  as  a  filtering  problem,  i.e.  estimating  the 
distribution  P(mt  \  su )  of  memory  states  m,  from  the 
lifelong  observations  su  =  sls2...st  up  to  time  t.  That  is, 
given  the  filtering  distribution  /’(/«,_,  |  ,v1(  l )  at  time  t-1,  the 
goal  is  to  recursively  estimate  the  filtering  distribution 
P(m,  |  su )  of  time  step  t: 

p(m,  I  A, )  =  P{p‘,S'^  ~  p(m, ’ s< » A,-i ) 

^1:,) 

p(mt  \s1,)~P(sl  \mt)P(m,  |s1:(_,) 

=  p(s,  \™,)'E,p(mnmt_ i  |.S']:,_,) 

m,- 1 

=  p(s,  I  m,)YjP(nh  I  \  s1:(_,) 

If  we  let  a(ml)  =  P(ml  |sI:()  we  have  now  a  recursive 
update  equation: 


a(mt)  =  P(s,  |  m^POn,  | 

Taking  into  account  the  actions  explicitly,  the  recursive 
lifelong  learning  becomes: 

a(mt) 

=  p(s,  I  ml)YjP(m!  |  ) 

=  w,)ZP(m'  lw,-i)«K-i) 

a,  /n,_j 

=  I  I  m,)'Y,P{m<  I  W,-l)«K-l) 

at  mt- 1 


We  note  that  the  factors  P(i,  |  a,),P{at  \  m, ),P{mt  |  m,_,) 
correspond  respectively  to  the  perception,  action,  and  the 
prediction  steps  in  Figure  1.  These  distributions  determine 
how  the  agent  interacts  with  the  environment  to  model  it 
and  attain  novel  information. 

2.3  Lifelong  Supervised  Learning 

The  perception-action  cycle  formulation  above  emphasizes 
the  sequential  nature  of  lifelong  learning.  However,  the 
nonsequential  learning  tasks,  such  as  classification  and 
regression,  can  also  be  incorporated  in  this  framework  as 
special  cases.  This  is  especially  true  for  concept  learning  in 
a  dynamic  environment  (Zhang  et  al.,  2012).  In  lifelong 
learning  of  classification,  the  examples  are  observed  as 
(x,,y,),  1  =  1,2,3,...  ,  but  the  examples  are  independent. 
The  goal  is  to  approximate  yt  =  / (x, ;  mt )  with  a  minimum 
loss  L{yq,yq)  for  an  arbitrary  query  input  x  .  Note  that  by 
substituting 

s,  :=  x, 

«,+i  :=  y,=f(x,\ml) 

Likewise,  the  lifelong  learning  of  regression  problems  can 
be  solved  within  this  framework.  The  only  difference  from 
the  classification  problem  is  that  in  regression  the  output 
yt  are  real  values  instead  of  categorical  or  discrete  values. 

3.  Learning  with  Observations 

3.1  Dynamical  Systems  and  Markov  Models 

Dynamical  systems  involve  sequential  prediction  (Figure 
2).  For  example,  time  series  data  consists  of  slr  =  st,...,sT. 
Since  the  time  length  T  can  be  very  large  or  infinite,  this 
problem  is  an  example  of  lifelong  learning  problems.  In 
addition,  the  learner  can  observe  many  or  indefinite  series 
of  different  times  series,  in  which  case  each  time  series  is 
called  an  episode. 


Figure  2:  Learning  with  observations 

Dynamical  systems  can  be  represented  by  Markov 
models  or  Markov  chains.  The  joint  distribution  of  the 
sequence  of  observations  can  be  expressed  as 

p(s,T) = i  ^-i) = np(A  i  Vi ) 

1=1  1=1 

where  in  the  second  equality  we  used  the  Markov 
assumption,  i.e.  the  current  state  is  dependent  only  on  the 
one  previous  step. 

In  time  series  prediction,  the  learner  has  no  control 
over  the  observations,  it  just  passively  receives  the 
incoming  data.  The  goal  of  the  learner  is  to  find  the  model 
mt  that  best  predicts  the  next  state  sl+]  =  f(st ;  mt )  given 
the  current  state  st  .  How  do  we  define  the  word  “best” 
quantitatively?  In  the  following  subsections  we  examine 
three  measures:  prediction  error,  predictive  information, 
and  information  bottleneck.  The  last  two  criteria  are  based 
on  information-theoretic  measures. 

3.2  Prediction  Error 

The  accuracy  of  time  series  prediction  can  be  measured  by 
prediction  error,  i.e.  the  mean  squared  error  (MSE) 
between  the  predicted  states  ,S’(+I  and  the  observed  states 
■W 

mse(s1:T )  =  — l—  XCVi  -  ■Vi)2 

l  1  t= 1 

where  the  prediction  st+l  is  made  by  using  the  learner’s 
current  model,  i.e.  5,+1  =  /(.?, ;  mt ))  and  n  is  the  length  of 
the  series.  Thus,  a  natural  measure  is  the  root  of  the  MSE 
or  RMSE  and  the  learner  aims  to  minimize  it: 

mini?M5'li(5'l.r)  =  min  JMSE(st.T) 

m  m  * 

where  m  e  M  is  the  model  parameters. 
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3.3  Predictive  Information 


For  the  evaluation  of  a  time  series  with  an  indefinite  length, 
predictive  information  (Bialek  et  al.,  2001)  has  been 
proposed.  It  is  defined  as  the  mutual  information  (MI) 
between  the  future  and  the  past,  relative  to  some  instant  of 
t: 


1 (*^ future  past)  (  lt)g2 


future  s  $ past ) 
P(SfituJP(S  past  )  f 


where  (■}  symbol  denotes  an  expectation  operator.  If  S  is  a 
Markov  chain,  the  predictive  information  (PI)  is  given  by 
the  MI  between  two  successive  time  steps. 


p(st+l,s,) 


I(SM-,St)=  log 


p(s,) 


Several  authors  have  studied  this  measure  for  self- 
organized  learning  and  adaptive  behavior.  Zahedi  et  al. 
(2010),  for  example,  found  the  principle  of  maximizing  the 
predictive  information  effective  to  evolve  a  coordinated 
behavior  of  the  physically  connected  robots  starting  with 
no  knowledge  of  itself  or  the  world.  Friston  (2009)  argues 
that  self-organizing  biological  agents  resist  a  tendency 
to  disorder  and  therefore  minimize  the  entropy  of  their 
sensory  states.  He  proposes  that  the  brain  uses  the  free- 
energy  principle  for  action,  perception,  and  learning. 


Figure  3:  Learning  with  actions 


Creutzig  et  al.  (2009)  shows  that  the  past-future 
information  bottleneck  method  can  make  the  underlying 
predictive  structure  of  the  process  explicit,  and  capture  it 
by  the  states  of  a  dynamical  system.  From  the  lifelong 
learning  point  of  view,  this  means  that  from  repeated 
observations  of  the  dynamic  environment  the  measure 
provide  an  objective  function  that  the  learner  can  use  to 
identify  the  regularity  and  extract  the  underlying  structures. 

4.  Learning  with  Actions 

4.1  Interactive  Learning 


3.4  Information  Bottleneck 

The  information  bottleneck  method  is  a  technique  to 
compress  an  unknown  random  variable  X ,  when  a  joint 
probability  distribution  between  X  and  an  observed 
relevant  variable  Y  is  given  (Tishby  et  al.,  1999).  The 
compressed  variable  is  Z  and  the  algorithm  minimizes  the 
quantity:  minp(zW I(X;Z)~ J3I(Z;Y),  where  I(X;Z)  are 

the  mutual  information  between  X  and  Z 

Creutzig  et  al.  (2009)  proposes  to  use  the  information 
bottleneck  to  find  the  properties  of  the  past  that  are 
relevant  and  sufficient  for  predicting  the  future  in 
dynamical  systems.  Adapted  in  our  notation,  this  past- 
future  information  bottleneck  is  written  as: 

;S,)-/3I(S,;Sl+1)} 

where  St,SM,St  are  respectively  the  input  past,  the  output 
future,  and  the  model  future.  Given  past  signal  values  a 
compressed  version  of  the  past  is  to  be  formed  such  that 
information  about  the  future  is  preserved.  When  varying  /?, 
we  obtain  the  optimal  trade-off  curve,  also  known  as  the 
information  curve,  between  compression  and  prediction, 
which  is  a  more  complete  characterization  of  the 
complexity  of  the  process. 


We  now  consider  the  learning  agents  that  perform  actions 
on  the  environment  to  change  the  states  of  the  environment 
(Figure  3).  An  example  of  this  paradigm  is  the  interactive 
learning  (Still,  2009).  Assume  that  the  learner  interacts 
with  the  environment  between  consecutive  observations. 
Let  one  decision  epoch  consists  in  mapping  the  current 
history  h,  available  to  the  learner  at  time  t,  onto  an  action 
(sequence)  a  that  starts  at  time  t  and  takes  time  A  to  be 
executed.  The  problem  of  interactive  learning  is  to  choose 
a  model  and  an  action  policy,  which  are  optimal  in  that 
they  maximize  the  learner’s  ability  to  predict  the  world, 
while  being  minimally  complex. 

The  decision  function,  or  action  policy,  is  given  by  the 
conditional  probability  distribution  P(at  \  h! ).  Let  the 
model  summarize  historical  information  via  the  probability 
map  P(st  |  /?, ).  The  learner  uses  the  current  state  s , 
together  with  knowledge  of  the  action  at  to  make 
probabilistic  predictions  of  future  observations,  s,+1 : 

p(sl+i  I  s„at) 

=  p,  1  ,  (P(sM  |  h,a,)P(a,  |  s,)P(s,  \  h)) 

p{s,,a w 

The  interactive  learning  problem  is  solved  by 
maximizing  /( {.Sj ,At}; Sl+] )  over  P(st  \  ht)  and  P(at  \  ht)  , 
under  constraints  that  select  for  the  simplest  possible 


model  and  the  most  efficient  policy,  respectively,  in  terms 
of  smallest  complexity  measured  by  the  coding  rate.  Less 
complex  models  and  policies  result  in  less  predictive 
power.  This  trade-off  can  be  implemented  using  Lagrange 
multipliers,  A  and  ji .  Thus,  the  optimization  problem  for 
interactive  learning  (Still,  2009)  is  given  by 


max 

P(s,\h,),P{a,\h,) 


Note  that  interactive  learning  is  different  from 
reinforcement  learning,  which  will  be  discussed  in  the  next 
section.  In  contrast  to  reinforcement  learning,  the 
predictive  model  approach  such  as  interactive  learning  asks 
about  behavior  that  is  optimal  with  respect  to  learning 
about  the  environment  rather  than  with  respect  to  fulfilling 
a  specific  task.  This  approach  does  not  require  rewards. 
Conceptually,  the  predictive  approach  could  be  thought  of 
as  “rewarding”  information  gain  and,  hence,  curiosity.  In 
that  sense,  it  is  related  to  curiosity  driven  reinforcement 
learning  (Schmidhuber,  1991,  Still  &  Precup,  2012), 
where  internal  rewards  are  given  that  correlate  with  some 
measure  of  prediction  error.  However,  the  learner’s  goal  is 
not  to  predict  future  rewards,  but  rather  to  behave  such  that 
the  time  series  that  it  observes  as  a  consequence  of  its  own 
actions  is  rich  in  causal  structure.  This,  in  turn,  allows  the 
learner  to  construct  a  maximally  predictive  model  of  its 
environment. 


4.2  Empowerment 

Empowerment  measures  how  much  influence  an  agent  has 
on  its  environment.  It  is  an  information-theoretic 
generalization  of  joint  controllability  (influence  on 
environment)  and  observability  (measurement  by  sensors) 
of  the  environment  by  the  agent,  both  controllability  and 
observability  being  usually  defined  in  control  theory  as  the 
dimensionality  of  the  control/observation  spaces  (Jung  et 
al.,  2012). 

Formally,  empowerment  is  defined  as  the  Shannon 
channel  capacity  between  At  ,  the  choice  of  an  action 
sequence,  and  Sl+] ,  the  resulting  successor  state: 

C(st)  =  maxp([i) I(St+],At  |.s'() 

=  maxp(a)  [H(St+l  \  st)-H(St+l  |  ^,s,)} 

The  maximization  of  the  mutual  information  is  with 
respect  to  all  possible  distribution  over  Ar  The 
empowerment  measures  to  what  extent  an  agent  can 
influence  the  environment  by  its  actions.  It  is  zero  if, 
regardless  what  the  agent  does,  the  outcome  will  be  the 
same.  And  it  is  maximal  if  every  action  will  have  a  distinct 
outcome. 

It  should  be  noted  that  empowerment  is  fully  specified 
by  the  dynamics  of  the  agent-environment  coupling  (i.e. 


the  transition  probabilities)  and  a  reward  does  not  need  to 
be  specified.  Empowerment  provides  a  natural  utility 
function  which  imbues  its  states  with  an  a  priori  value, 
without  an  explicit  specification  of  a  reward.  This  enables 
the  system  to  keep  alive  indefinitely. 

5.  Learning  with  Rewards 

5.1  Markov  Decision  Processes 

In  some  settings  of  lifelong  learning,  the  agent  receives 
feedback  information  from  the  environment.  In  this  case, 
the  agent’s  decision  process  can  be  modeled  as  a  Markov 
decision  process  (MDP).  MDPs  are  a  popular  approach  for 
modeling  sequences  of  decisions  taken  by  an  agent  in  the 
face  of  delayed  accumulation  of  rewards.  The  structure  of 
the  rewards  defines  the  tasks  the  agent  is  supposed  to 
achieve. 

A  standard  approach  to  solving  the  MDP  is 
reinforcement  learning  (Sutton  &  Barto,  1998),  which  is  an 
approximate  dynamic  programming  method.  The  learner 
observes  the  states  st  of  the  environment,  take  actions  a , 
on  the  environment,  and  gets  rewards  /;  from  it  (Figure  4). 
This  occurs  sequentially,  i.e.  the  learner  observes  the  next 
states  only  after  it  takes  actions.  An  example  of  this  kind  of 
learner  is  a  mobile  robot  that  sequentially  measures  current 
location,  takes  motions,  and  reduces  the  distance  to  the 
destination.  Another  example  is  a  stock-investment  agent 
that  observes  the  state  of  the  stock  market,  makes  sell/buy 
decisions,  and  gets  payoffs.  It  is  not  difficult  to  imagine 
extending  this  idea  to  develop  a  lifelong  learning  agent  that 
incorporates  external  guidance  and  feedback  from  humans 
or  other  agents  to  accumulate  knowledge  from  experience. 

5.2  Value  Functions 

The  goal  of  reinforcement  learning  is  to  maximize  the 
expected  value  for  the  cumulated  reward.  The  reward 
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function  is  defined  as  R(sl+ ,  st , at )  or  rM  =r(st,a, ) .  This 
value  is  obtained  by  averaging  over  the  transition 
probabilities  T(sl+l\  st,at)  and  the  policy  n{at  Is1,)  or 
at  =  tt(.si  )  Given  a  starting  state  s  and  a  policy  n ,  the 
value  of  the  state  s,  following  policy  n  can  be 

expressed  via  the  recursive  Bellman  equation  (Sutton  & 
Barto,  1998), 

V'(st) 

=  Z  *(<*>  i  s< )  Z  T  (5<+i  1 >  at )  [*cvi  i s,  , ) + v*  ( s,+i )] 

a,<=A  smeS 

Alternatively,  the  value  function  can  be  defined  on  state- 
action  pairs: 

Qx(s„a,)=  Z  T(si+ 1  I  A>«r)[^<A+i  \s„at)  +  V*(sM)\ 

sM<=S 

which  is  the  utility  function  attained  if,  in  state  st  ,  the 
agent  carries  out  action  at ,  and  after  that  begins  to  follow 
n. 

5.3  Information  Costs 

If  there  are  multiple  optimal  policies,  then  asking  for  the 
information-theoretically  cheapest  one  among  these 
optimal  policies  becomes  more  interesting.  Tishby  & 
Polani  (2010)  and  Polani  (2011)  propose  to  introduce 
information  cost  term  in  policy  learning.  It  is  even  more 
interesting  if  we  do  not  require  the  solution  be  perfectly 
optimal.  Thus,  if  we  wish  the  expected  reward  E[F(S)]  to 
be  sufficiently  large,  the  information  cost  for  such  as 
suboptimal  (but  informationally  parsimonious)  policy  will 
be  generally  lower. 

For  a  given  utility  level,  we  can  use  the  Lagrangian 
formalism  to  formulate  the  unconstrained  minimization 
problem 

min  ,{r(S,;4)-/?£[0'(S,,4)]} 

where  /'r((S',;^,)  measures  the  decision  cost  incurred  by 
the  agent: 

IR  (S, ;  A, )  =  Z  P(s,  )Z  x(at  \  st )  log  ^  ^ 

st  at  ) 

where  P(at)  =  z,+  \st+i)P(sl+l).  The  term 

In (St]  At)  denotes  the  information  that  the  action  At 
carries  about  the  state  5,  under  policy  n. 


5.4  Interestingness  and  Curiosity 

The  objective  function  consisting  of  the  value  function  and 
the  information  cost  can  balance  the  expected  return  with 
minimum  cost.  Flowever,  this  lacks  any  notion  of 
interestingness  (Zhang,  1994)  or  curiosity  (Schmidhuber, 
1991).  In  Section  4  we  have  seen  this  aspect  being 
reflected  in  the  predictive  power  and  empowerment  (Jung 
et  ah,  2011).  The  objective  function  can  be  extended  by 
the  predictive  power  (Still  &  Precup,  2012).  Using 
Lagrange  multipliers,  we  can  formulate  the  lifelong 
learning  as  an  optimization  problem: 

arg  max  {/*  {{S, ,  A, } ;  Sl+i  )  +  aV; (q)  -  XI (S, ;  A, )} 
q 

where  q(a,  st)  is  the  action  policy  to  be  approximated. 
The  ability  to  predict  improves  the  performance  of  a 
learner  across  a  large  variety  of  specific  behaviors. 

The  above  objective  function  embodying  the  curiosity 
terms  as  well  as  the  value  and  information  cost  terms  can 
thus  be  an  ideal  guideline  for  a  lifelong  learner.  The 
predictive  power  term  1*  ({5’(,^};5’f+1 )  allows  for  the 
agent  to  actively  explore  the  environment  to  extract 
interesting  knowledge.  The  information  cost  term  I(S,;  At) 
enables  the  learner  to  minimize  the  interaction  with  the 
environment  or  teacher.  This  all  happens  with  the  goal  of 
maximizing  the  value  or  utility  Vtx (q)  of  the  information 
the  agent  is  acquiring. 

6.  Summary  and  Conclusion 

We  have  formulated  lifelong  learning  as  a  sequential, 
online,  incremental  learning  process  over  an  extended 
period  of  time  in  a  dynamic,  changing  environment.  The 
hallmark  of  this  lifelong-learning  framework  is  that  the 
training  data  are  observed  sequentially  and  not  kept  for 
iterative  reuse.  This  requires  instant,  online  model  building 
and  incremental  transfer  of  knowledge  acquired  from 
previous  learning  to  future  learning,  which  can  be 
formulated  as  a  Bayesian  inference. 

The  Bayesian  framework  is  general  enough  to  cover 
the  perception-action  cycle  model  of  cognitive  systems  in 
its  various  instantiations.  We  applied  the  framework  to 
develop  a  taxonomy  of  lifelong  learning  based  on  the  way 
of  obtaining  learning  examples.  We  distinguished  three 
paradigms:  learning  with  observations,  learning  with 
actions,  and  learning  with  rewards.  For  each  of  the 
paradigms  we  examined  the  objective  functions  of  the 
lifelong  learning  styles. 

The  first  paradigm  is  lifelong  learning  with  passive, 
continual  observations.  Typical  examples  are  time  series 
prediction  and  target  tracking  (filtering).  The  objective 
functions  for  this  setting  are  prediction  errors  and 
predictive  information,  the  latter  being  defined  as  the 


mutual  information  between  the  past  and  future  states  in 
time  series.  The  information  bottleneck  method  can  also  be 
modified  to  measure  the  predictive  information. 

The  second  paradigm  is  lifelong  learning  with  actions 
(but  without  reward  feedbacks).  Interactive  learning  and 
empowerment  are  the  examples.  Here,  the  learner  actively 
explores  the  environment  to  achieve  maximal  predictive 
power  at  minimal  complexity  about  the  environment.  In 
this  paradigm,  the  agent  takes  actions  on  the  environment 
by  action  policy,  but  does  not  receive  rewards  from  the 
environment  for  its  actions  on  the  environment.  The  goal  is 
mainly  to  know  more  about  the  world.  Simultaneous 
localization  and  mapping  (SLAM)  in  robotics  is  an 
excellent  example  of  the  interactive  learning  problem, 
though  no  literature  is  found  on  explicit  formulation  of 
SLAM  as  interactive  learning. 

The  third  paradigm  is  active  lifelong  learning  with 
explicit  rewards.  This  includes  the  MDP  problem  for 
which  approximate  dynamic  programming  and 
reinforcement  learning  have  been  extensively  studied.  The 
conventional  objective  function  for  MDPs  is  the  value 
function  or  the  expected  reward  of  the  agent.  As  we  have 
reviewed  in  this  paper,  there  have  been  several  proposals 
recently  to  extend  the  objective  function  by  incorporating 
information-theoretic  factors.  These  objective  functions 
can  be  applied  to  lifelong  learning  agents,  for  example,  to 
attempt  to  minimize  information  costs  while  maximizing 
the  predictive  information  or  curiosity  for  a  given  level  of 
expected  reward  from  the  environment.  These  approaches 
are  motivated  by  information-theoretic  analysis  of  the 
perception-action  cycle  view  of  cognitive  dynamic  systems. 

In  this  article,  we  have  focused  on  the  sequential, 
predictive  learning  aspects  of  lifelong  learning.  This 
framework  is  general  and  thus  can  incorporate  the  classes 
of  lifelong  classification  and  regression  learning.  Since 
these  supervised  learning  problems  do  not  care  about  the 
sequence  of  observations,  the  sequential  formulations 
presented  in  this  paper  can  be  reused  by  ignoring  the 
temporal  dependency.  We  also  did  not  discuss  the  detailed 
mechanisms  of  learning  processes  for  the  lifelong  learning 
framework.  Future  work  should  relate  the  information- 
theoretic  objective  functions  to  the  representations  to 
address  questions  like  "how  to  discover  and  revise  the 
knowledge  structures  to  represent  the  internal  model  of  the 
world  or  environment”  (Zhang,  2008). 

As  a  whole,  we  believe  the  general  framework  and  the 
objective  functions  for  lifelong  learning  described  here 
provide  a  baseline  for  evaluating  the  representations  and 
strategies  of  the  learning  algorithms.  Specifically,  the 
objective  functions  can  be  used  for  innovating  algorithms 
for  discovery,  revision,  and  transfer  of  knowledge  of  the 
lifelong  learners  over  the  extended  period  of  experience. 
Our  emphasis  on  information  theory-based  active  and 
predictive  learning  with  minimal  mechanistic  assumptions 


on  model  structures  can  be  especially  fruitful  for 
automated  knowledge  acquisition  and  sequential 
knowledge  transfer  between  a  wide  range  of  similar  but 
significantly  different  tasks  and  domains. 
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